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Abstract 

We consider the downlink of a cellular system and address the problem of mul- 
tiuser scheduling with partial channel information. In our setting, the channel of 
each user is modeled by a three-state Markov chain. The scheduler indirectly es- 
timates the channel via accumulated Automatic Repeat Request (ARQ) feedback 
from the scheduled users and uses this information in future scheduling decisions. 
Using a Partially Observable Markov Decision Process (POMDP), we formulate a 
throughput maximization problem that is an extension of our previous work where 
the channels were modeled using two states. We recall the greedy policy that was 
shown to be optimal and easy to implement in the two state case and study the 
implementation structure of the greedy policy in the considered downlink. We 
classify the system into two types based on the channel statistics and obtain round 
robin structures for the greedy policy for each system type. We obtain performance 
bounds for the downlink system using these structures and study the conditions 
under which the greedy policy is optimal. 

Index Terms-Markov channel, downlink, ARQ, multiuser scheduling, greedy policy. 



1 Introduction 

Opportunistic multiuser scheduling, introduced by Knopp and Humblet in pQ and defined 
as allocating the resources to the user experiencing the most favorable channel conditions 
has gained immense popularity among network designers in the recent past. Opportunis- 
tic multiuser scheduling essentially taps the multiuser diversity in the system and has 
motivated several researchers ([2j [3j HJ EE E]) to study the performance gains obtained 
by opportunistic scheduling under various scenarios. While i.i.d flat fading model is 
popularly used by researchers in modeling time varying channels, it fails to capture the 
memory in the channel observed in realistic scenarios. This motivated the researchers to 
consider the Gilbert Elliott model [7] that represents the channel by a two state Markov 
chain. Specifically, a user experiences error-free transmission when it observes a "good" 
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channel, and unsuccessful transmission in a "bad" channel. Several works have been done 
on opportunistic multiuser scheduling in this Markov modeled channel [HI El EDI Ell E2]- 
It is understandable that the availability of the channel state information at the scheduler 
is crucial for the success of the opportunistic scheduling schemes. Traditionally, when 
the scheduler has no channel information, pilot based channel estimation is performed 
and the estimates are used for scheduling decisions ([21 El [13]). A new line of work, 
[111 ESI ESI EH EE] , attempts to exploit Automatic Repeat reQuest (ARQ) feedback, tra- 
ditionally used for error control at the data link layer, to estimate the state of the two 
state Markov modeled channels. 

In [16] and [18], for a two state Markov modeled downlink (one to many communi- 
cation) system, a greedy policy has been shown to be optimal from a sum throughput 
point of view. This greedy policy is shown to be amenable to an easy implementation 
with a simple round robin based algorithm that takes as input the ARQ feedback from 
the scheduled user. Although modeling the channel by a two state Markov chain is a 
welcome change from the traditional memoryless models, the scheduler can do better by 
discriminating the channel conditions on a finer level, i.e., if the channel is modeled by 
higher state Markov chains. As a first step in this direction, in this report, we model the 
channels by three state Markov chains and study the property of the greedy policy and 
conditions under which it will be optimal. 

The report is organized as follows. The problem setup is described in Section [2] 
followed by a study of the implementation structure of the greedy policy in Section [3j A 
comparison of the original system with the genie-aided system, introduced in [18] . will 
be made in Section HI In Section [5j upper and lower bounds to the system performance 
is derived. We study the conditions under which greedy policy is optimal in Section [61 
Conclusions are provided in Section [71 



2 Problem Setup 

2.1 Channel Model - Probability Transition Matrix 

We consider downlink transmissions with 2 users. The channel between the base station 
and each user is modeled by an i.i.d, first order, three-state Markov chain. Time is slotted 
and the channel of each user remains fixed for a slot and evolves into another state in 
the next slot according to the Markov chain statistics. The time slots of all users are 
synchronized. The three-state Markov channel is characterized by a 3 x 3 probability 
transition matrix 



Pll Pl2 Pl3 
P21 P22 P23 
P31 P32 P33 



(1) 



where Pij is the probability of evolving from state i to state j in the next slot. 

The states are made to represent the quantized strength of the channel, with state 
1 assumed to represent the lower end of the channel strength spectrum and state 3 
representing the higher end. We assume that the Markov chain is positively correlated 
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in time. Thus pa > pji if j ^ i. Also, motivated by observation of realistic channels, 
we assume that the channel evolves in a smooth fashion across time. Thus P21 > P31 
and P23 > Pi3- Also, observing that state 3 represents a region of the channel strength 
spectrum that is not bounded from above, it is reasonable to assume P32 < Pi2- To 
summarize, the transition matrix elements are related as below: 

Pll > P21 > P31 
P22 > Pl2 > P32 

P33 > P23 > Pl3 (2) 

2.2 Existence of Steady State 

Let p ss denote the steady state probability vector of the Markov chain with p ss (i) repre- 
senting the steady state probability of state i. We now rule out the instances of P matrix 
entries that either 1) obviously lead to a steady state p ss (i) = for some i G {1,2,3} 
or 2) eliminates the possibility of a steady state altogether. Both these cases trivialize 
the scheduling problem we address in this report. From the inequalities governing the 
elements of the P matrix, we see that pa > 0. Otherwise, p^ = leading to p ss {i) = 0. 
Also pu 7^ 0. Since, if pu = 0, then p 32 = 0. Thus p ss (2) = (since when the channel 
enters state 1 or 3 it will never reach state 2 again). For similar reasons, P21 7^ and 
P23 7^ 0. Thus the only elements that can be zero are pis, P31 and p^2- Among these 
P31 and P32 cannot be both zero. Otherwise, ^33 = 1 making state 3 an absorbing state 
leading to p ss (l) = P<j S (2) = 0. Thus there can be at most one zero in any row and 
at most one zero in any column of P. This, along with the fact that all the entries are 
non- negative, renders P 2 a positive matrix (i.e., all the elements are positive). 

From [TS] (p-51), A nonnegative square matrix, A is said to be regular iff there exists 
a natural number r such that A r is a positive matrix. Thus P is a regular matrix with 
r > 2. 

We now reproduce Theorem 4.2 from [19J 

Theorem 1. If A is a regular stochastic matrix then A™ converges as n — ► 00 to a 
positive stable stochastic matrix ell , where II = ij{i)) Estate space is a probability vector 
with non null entries and e is a unit vector with dimension equal to the cardinality of the 
state space. 

Thus the n-step transition probability matrices of the Markov channels in our problem 
also converge to stable stochastic matrices. Since this is necessary and sufficient for the 
existence of steady state, under the conditions established earlier, the Markov channels 
in our problem have steady state with the steady state probability vector given by p ss . 

2.3 Scheduling Problem 

The base station is the central controller that controls the transmission to the users in 
each slot. In any time slot, the base station does not know the exact channel state of 
the users and it must schedule the transmission of the head of line packet of exactly 
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one user (a data queue is maintained for each user to collect the data meant for that 
user). Thus, a TDMA styled scheduling is performed here. The power spent in each 
transmission is fixed. At the beginning of each time slot, the head of the line packet 
of the scheduled user is transmitted and is dropped from the queue. The scheduled 
user, based on measurements of the signal strength of the received data packet, obtains 
information on the state of the channel and sends this back to the scheduler. We call 
this feedback as Fi with i G {1,2,3}. This channel state feedback is assumed to be 
transmitted over a dedicated error-free channel. This feedback information, along with 
the label of the slot in which it is acquired, will be used in future scheduling decisions. 
The performance metric that the base station aims to maximize is the sum reward of the 
system. Details will be discussed in the next section. 

2.4 Formal Problem Definition 

Since the base station must make scheduling decisions based on only a partial observation 
of the underlying Markov chain, we employ techniques from partially observable Markov 
decision process (POMDP) [20], ETJ [221 E3] theory in this work. We now proceed to 
introduce the terms/entities that we use in this work, many of which are borrowed from 
the POMDP literature. 

Control interval k: Each time slot in our problem setup will henceforth be called a 
control interval. The "end" of the POMDP is fixed. A control interval is indexed by k if 
there are k — 1 more intervals until the end of the process. 

Action a^. Indicates the index of the user (1 or 2) scheduled in control interval k. 

Belief vector of user i at the k th control interval ir^: Element Ttk,i(j) denotes the 
probability that the channel of user i G {1,2} in the k th control interval is in state 
j G {1,2,3}, given all the past information about that channel. If Fj was received from 
user i, I + 1 control intervals earlier with I G 0, 1,2, . . ., then the belief vector in the 
current interval k is given by n^/i = [pji pj 2 Pj^\P l . We will henceforth represent the 
vector [pji pj 2 Pj3] by pj. If user % is not scheduled in control interval k, then the belief 
vector of this user evolves to the next interval as follows: Ttk-i,i — ^k,iF. 

It has been proven in [20J that the belief vector n^i is a sufficient statistic to any 
information about the channel of user i in the past control interval, in our case, the 
scheduling decisions and the channel feedback information from the past. Thus the 
scheduling decision in any control interval can be solely based on the belief vectors for 
that interval and not on the past channel feedback or schedule information. 

Scheduling Policy A scheduling policy in the control interval A; is a mapping 
from the belief vectors and the control interval index to an action as follows: 

% ■ {^k,u^k,2,k) ->• a k V7c>l 

Note that the scheduling policy can, in general, be time- variant. 

Reward Structure: In any control interval k, a reward of aj is accrued when the 
scheduled user sends back F^. Let state 1 be defined such that no reward is accrued when 
an user in state 1 is scheduled, i.e., a\ = 0. This assumption can be satisfied by letting 
state 1 represent the channel strengths that do not allow any useful data transfer. Since 
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state 3 represents channel strengths that are better than those represented by state 2, we 
have «3 > a 2 . Throughout this report we will assume a 3 = 1 without loss of generality. 

Net Expected Reward in the control interval m, V m : With the belief vectors, 7r m)1 , 
7r m 2 and the scheduling policy, {2tfc}fc< m , fixed, the net expected reward, V m , is the sum 
of the reward, Rm{^m,a m i a m ), expected in the current control interval m and E[V^_i], 
the net reward expected in the future control intervals conditioned on the belief vectors 
and the scheduling decision in the current control interval. Formally, 

V m ( 

-l(7Tm-l,l) TTm-1,2) \&k}k<m-l) I 71 "*™,! > ^m,li a m], 

where the expectation is over the belief vectors n m -i,ii vr m _ 12 . With0 a = [di a 2 «3] T , 
the expected current reward can be written as 

Note that if a m was observed to be in state i in the previous interval then 7r m am = p, 
and R m {^m,a m ,a m ) = Pi«- 

Performance Metric- the Sum Reward, r\ sum : For a given scheduling policy, {%k}k>ii 
the sum reward is given by 

(cn, -i \ r Vm(Pss, Pss, {&fc}fc>l) / q n 

^sum({S fc } fe >i) = hm =-£-, (3) 

m-»oo m 

where p ss is the steady state probability vector of the underlying Markov channels. 
Optimal Scheduling Policy, {^}fe>i: 

{K}k>i = arg max rj sum ({& k } fc > x ) . (4) 
{3fc}fc>i 

3 Structure of the Greedy Policy 

Consider the following policy: 

ife : (7^,1,7^,2, k) -> a k = argmaxi4(7r fciafe ,a fc ) 

= argmaxTTfc^a VA; > 1. 

i 

Since the above given policy attempts to maximize the expected current reward, without 
any regard to the expected future reward, it follows an approach that is fundamentally 
greedy in nature. For this reason, we henceforth call {Sfc}fc>i the Greedy Policy. We now 
proceed to derive the implementation structure of the greedy policy. 

Lemma 2. For any k > 0, the immediate reward expected by scheduling an user that 
was observed k + 1 control intervals earlier, to be in state 2, lies between the rewards 
corresponding to states 3 and 1, i.e, 

Pl P k a < p 2 P k a < p 3 P k a, Vke 0,1,2,... (5) 
1 x T indicates the transpose of vector x. 
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Lemma 3. The immediate reward expected by scheduling an user that was observed, k + 1 
control intervals earlier, to be in state 3, monotonically decreases to p ss a as k increases 
from — > oo, i.e., 

p 3 P k+1 a < p 3 P k Oi, Vk G 0,1,2,... 
p 3 lim P k a = p ss a (6) 

Note that p ss ct is the immediate reward expected when no past information about the user 
is available or when the belief vector of the user equals the steady state vector, p ss . 

Lemma 4. The immediate reward expected by scheduling an user that was observed, k + 1 
control intervals earlier, to be in state 1, monotonically increases to p ss a as k increases 
from — > oo ; i.e., 

Pl P k+1 a > piP k a, \/k G 0,1,2, ... 
Pi lim P k a = p ss a (7) 

k— >oo 

Note that, from the above lemmas, we have 

p 2 lim P k a = p ss a. (8) 

In all the above results, the immediate reward approaches p ss a as the time since 
the last observation of the user increases. This is because, in the underlying first order 
Markov chain, the dependency between the states in two control intervals (memory) 
diminishes as the time gap between the control intervals increases. These lemmas are 
instrumental in obtaining the algorithm for implementing the greedy policy, that will be 
summarized soon. We first identify two types of system based on the property of the P 
matrix and the reward values. 

• Type I system: when p 2 a > p ss a 

• Type II system: when p 2 a < p ss a 

The implementation algorithm for the greedy policy significantly changes depending on 
the type of the system. 

Proposition 5. When the system is type I, the greedy policy is implemented as follows 

• If feedback F3 or F2 was received from the user scheduled in the previous control 
interval (identified as user s), reschedule the user in the current control interval. 

• Schedule the other user (identified as user u) if feedback F\ was received. 

Proof. Refering to Fig. when F% was received from user s, the expected reward if s is 
scheduled again is given by p 3 a. The expected reward if u is scheduled is a point on one 
of the three curves (for k > 0) in the figure. Note that p^a is greater than any point (the 
y-dimension) on any of the curves, thus establishing 'retain the schedule if F3 is received' 
policy. This result essentially stems from the following facts: 1) Higher reward (03 = 1) 
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Figure 1: Type I system 



is accrued when the scheduled user happens to be in state 3 than in other states. 2) The 
Markov channel is positively correlated in time (pa > pji if i ^ j). 

Similarly when F 1 was received from user s, the expected reward if s is scheduled again 
is given by p±a which is less than any other point on the three curves, thus establishing 
'switch if Fl is received' policy. 

When F 2 is received, assuming the greedy policy was implemented so far since the 
beginning of the scheduling process, the reward expected if u is scheduled lies on the 
lower curve piP k a for k > 0. This is because the first time (since the beginning of 
the scheduling process) a F 2 is received (call this interval m ), if greedy policy was 
implemented so far, user u (the waiting user) would not have given F 3 when it was 
dropped and since this is the first time F 2 is observed by the scheduler, u would not 
have sent F 2 either, when it was dropped. Therefore u must have sent Ft the last time 
it was scheduled (and hence dropped). Thus the reward expected if u is scheduled now 
(at m ) falls on the bottom curve leading to retaining of user s (since p 2 a > p±P k a for 
any k > 0). In the next instance of F 2 reception, the same logic holds (as long as greedy 
policy is implemented all along until this instance) and so on for subsequent instances of 
F 2 . Note that the condition greedy must be implemented since the beginning until 'now' is 
quite natural given our interest in implementing the policy in the current interval. Thus 
there is no loss of generality here. 

These arguments establish the proposition. □ 

Proposition 6. When the system is type II, the greedy policy is implemented as follows 

• If feedback F 3 was received from the user scheduled in the previous control interval 
(call it user s), reschedule the user in the current control interval. 

• If feedback Fl was received, schedule the other user. 
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• If feedback F 2 was received, calculate the expected immediate reward if the other 
user (identified as user u) is scheduled in the current interval (identified as m) 
as follows: n mtU a where 7r mjtt is the belief vector of user u in the current control 
interval m. Now, schedule user s is > vr m)U a. Otherwise, schedule user u. 

Proof. Refer to Fig. [2j The argument for F 3 and Fi are the same as in the previous case. 

k 




, , TYPE II SYSTEM 

p alpha 



Figure 2: Type II system 

When F2 is received, as seen from the Fig. [2} the waiting user u could have an expected 
reward greater than that of s if u had been dropped due to F\ at least ko intervals earlier 
or if p2-P fc a does not monotonically increase to p ss a (Fig. [2] shows such a situation). 
Thus it is necessary to explicitly calculate the expected reward of user u before making 
a greedy decision. □ 

Note that the results in Lemma [20 and hence the implementation structure of the 
greedy policy in Propositions [MS] hold even when a± > as long as ct\ < a 2 < a%. 

4 Comparison with the Genie aided system 

In the two user, two state case, if the user scheduled (user s) in the previous control 
interval was observed to be in the best state, the scheduler retains the schedule (and 
hence accrues the best possible reward) since there is nothing more to gain by scheduling 
to the other user, while a loss is possible if the other user was in the worst state. Similarly, 
if user s was observed to be in the worst state, the scheduler switches to the other, since 
there is nothing more to lose by scheduling to the other user (as compared to scheduling 
s again), while a gain is possible if the other user was in the best state. Thus the two 
user, two state system is equivalent, in performance, to a genie-aided system where the 
scheduler learns about the states of both the users at the end of every interval. 
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This equivalence does not hold in the three state system. The nothing more to gain 
argument works when s was observed to be in state 3 and the nothing more to lose 
argument works when s was observed to be in state 1. However, when s was observed 
to be in state 2, i.e., when F 2 was received, by scheduling to the other user (user u), the 
scheduler may either gain (if u was in state 3) or lose (if u was in state 3) as compared to 
when it schedules s again. Thus with information about the state of the other user, there 
is definitely a room for improvement. Thus the three state (in general, more than two 
states) system is not equivalent to the genie-aided system. Note that, the genie aided 
system can be redefined as follows: the scheduler learns about the state of both the 
users iff s was observed in state 2. We see from the discussion so far that this modified 
definition does not impart any performance loss in the genie-aided system. 

From the preceding discussion, it can be seen that the original three user system 
approaches the genie-aided system under any of the following cases: 

• P2« = P3«- Thus on receiving F 2 from user s, nothing more can be gained by 
scheduling the other user u (while a loss is possible on switching). Hence, s is 
rescheduled. Thus there is no need to learn the previous control interval state of u. 

• p2« = pict. Thus on receiving F 2 from user s, nothing more can be lost by schedul- 
ing the other user u (while a gain is possible on switching). Hence, u is scheduled. 
Again, there is no need to learn the previous control interval state of u. 

With mathematical analysis, it can be seen that case 1 is possible iff a 2 = a 3 and 
P21 = P3i- While case 2 is possible iff a 2 = a 3 and p n = p 2 \. When the first set of 
conditions is satisfied, it can be seen that the states 2 and 3 can be merged at a very 
generic level (not specific to the type of information used for scheduling) with the reduced 
transition matrix given as below: 

Pu P12+P13 
P21 P22 + P23 

where row 1 and column 1 corresponds to state 1 and row2 and column2 corresponds to 
the merged state. Thus the channel is effectively modeled by a two-state Markov chain 
thus explaining the equivalence with the genie-aided system. 

However, it is interesting to note that, when the second set of conditions is satis- 
fied, such a merger is not possible between states 1 and 2 since we still have p± 3 < p 2S 
making them different in their relationship with state 3. However, in the context of the 
ARQ based scheduling problem (specifically case 2 in the preceding discussion), they are 
synonymous and render the original system equivalent to the genie-aided system. 

5 Bounds On the System Sum Reward Capacity 

Proposition 7. For the type I system, a lower bound to the sum reward capacity, Slbj, 
is given as 

Slbj > p 2 « - pL( 1 )(P2« - Pi«) (10) 
where p^ s (l) is the steady state probability that the state of the user is 1. 



(9) 
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This bound is obtained by replacing expected reward given F 3 , i.e., p 3 a with p 2 a in 
the sum reward evaluation of the greedy policy. Thus this is in fact a lower bound to 
the greedy policy. Note that Slbj decreases as the steady state probability of the less 
rewarding state 1 (p ss (l)) increases. Also notice that as pia — > p 2 a, Slb — ► P2«- This 
is expected in light of the approach we used in obtaining Slb, since the only reward that 
we accrue in any control interval is now p 2 a. Also, the bound approaches the system sum 
reward capacity when states 2 and 3 become increasingly synonymous. This happens as 
ct 2 — > ct 3 and p 3 i — > p 2 i- The last statement comes from our discussion in the previous 
section, on the equivalence with the genie-aided system. 

Proposition 8. For the type II system, a lower bound to the sum reward capacity is 
given as 

Slbji = (2p ss (3) - Ps 2 s (3))p 3 a + (1 - Pss (3)) 2 p 3 « (11) 

The proof proceeds as follows: In any control interval the expected immediate reward 
after a feedback F 2 is received in the previous interval is replaced by the reward that would 
be expected if the other (not scheduled in the previous interval) user were scheduled. Note 
that, by the implementation structure of the greedy policy, this latter reward is < the 
reward corresponding to the greedy choice^]. Next we replace p 2 a with p\a giving the 
sum reward capacity lower bound. 

Note that Slbji is the same as a two user system that accrues reward p 3 a if at least 
one of the users are in state 3 and reward p\a if none of them are in state 3. This 
interpretation is strikingly similar to the interpretation we made in the two-state tow 
user problem in our preliminary research. However, note that the present interpretation 
does not yield to the case when the state of both users are available. For instance, if 
none of the users is in state 3 and at least one of them is in state 2, then, ideally, if the 
states of both the users are known, a reward of p 2 a must be accrued instead of pict. 
This demonstrates the loss in performance due to lack of knowledge of both user states, 
thus differentiating the 3-state system from the 2-state system. 

Proposition 9. An upper bound to the system sum reward capacity is given as 

S UB = (2p ss (3)-p 2 s (3))p 3 a + (2 Pss (l)p ss (2) + p 2 s (2))p 2 a + pL(l)pia (12) 

The bound is actually the sum reward capacity of the genie-aided system. Here if at 
least one of the users was in state 3 in the previous interval, the greedy policy schedules 
that user and accrues a reward p 3 ct. If none of the users were in state 3 but at least one 
of them in state 2, that user is scheduled and a reward of p 2 a is accrued. If both the 
users were in state 1, a reward of pict is accrued. 

6 On the Optimality of the Greedy Policy 

We proceed by introducing the following properties of the P matrix. 

2 The replacement is only with respect to the accrued reward in the sum reward expression, while the 
actual schedule decision is always maintained as greedy, so as not to disturb the initial conditions of the 
problem for the future intervals. 
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Lemma 10. When p 2 P[001] T < p 23 (condition (A)), then p 2 P fc+1 [001] T < p 2 P fc [001] T 
VA; > 0. Also the steady state element p ss (3) < p 23 and £> 2 P fc [001] T monotonically de- 
creases to p ss {3) as k — > oo. (A) is also a necessary condition for the preceding statement 
to hold. 

Lemma 11. Under (A) from previous lemma, piP fc [001] T monotonically increases to 
p ss (3) ask —> oo, i.e, piP fe+1 [001] T > piP fc [001] T VA; G 0, 1, 2, ... and pi lim,^ P fc [001] T = 

Pss(3) < P23 

This result can be obtained by replacing a in Lemma|4]by [0 1] T , and using p ss (3) < 
P23 from Lemma [TU1 

Proposition 12. When p 12 = p 22 = p 32 and P23P31 > P21P13, greedy policy is optimal 
among the policies that retain the schedule when feedback F 3 is received. 

Conjecture 13. When p 12 = P22 = P32 and P23P31 > P21P13, greedy policy is globally 
optimal. 

The premise behind our conjecture is that, in light of the positive correlation property 
of the Markov chain, there is no obvious reason why the globally optimal policy would 
reject an user that was in the best state possible in the previous control interval. Thus 
it appears that the globally optimal policy belongs to the class that retains the schedule 
when P3 is received suggesting that it is indeed the greedy policy itself. 

7 Conclusion 

We have considered the problem of scheduling under partial channel state information 
assumption in a Markov-modeled two-user downlink system with a channel state feedback 
provision. We classified the system in two types based on the transition probability 
matrix of the Markov chains and the reward structure. For each type, we established the 
implementation structure of the greedy policy. For the type I system, we showed that 
the greedy policy can be implemented via a simple round robin algorithm as was seen 
in our earlier work for the two-state Markov model. We studied the conditions under 
which the original system simplifies to the genie aided system and provided insights 
on these conditions. By appropriately bounding the immediate reward accrued in any 
control interval, we obtained bounds to the sum reward capacity of the system. Under 
some conditions on the P matrix, by restricting the search space to a specific class of 
schedulers, we showed that the greedy policy is 'constrained search space' optimal and 
conjectured, with reasons, that the greedy policy is globally optimal as well. 
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A Proof of Lemma [2 



Let P = [(3i p 2 P 3 ] T , with Pi < p 2 < ^3. Consider the inequality p 3 P > p 2 /3. This can 
be rewritten as, 

01P31 + P2P32 + P3P33 > fi\Vl\ + P2P22 + /3 3 P23 

Pl{P31 - P2l) > Ai(P22 ~ P32) + Ai(P23 ~ P33) 
Pi (P21 -P3l) < -fa (P22 - P32 ) + P3 {P33 - P23 ) 

(13) 

Since P 2 > Pi, it is now sufficient to prove P 2 (p 2 i - P31 + P22 ~ P32) < lh(jPaa - P23), i-e., 
^2(^33 — P23) < ^3(^33 — P23) which is indeed true. Consider the inequality p 2 p > PiP, 

P1P21 + P2P22 + P3P23 > PiPu + P2P12 + P3P13 

^ P2+P2 3 (P3-P2)-p2l(P2-Pl) > P2 + Pl3(P 3 - P2) - Pll(P2 - Pi) (14) 

The last inequality is indeed true, since p 2 % > pi 3 , p 2 i < pn and P3 > P 2 > Pi- Thus if 

Pi<P2< P 3 and P = [Pi p 2 p 3 ] T , 

p 3 /3 > P2P > PiP (15) 

We can write, for i 6 1,2, 3, p { P k+1 a = p;[piP fc a p 2 P k a p 3 P fc a] T . Thus if piP k a < 
p 2 P k a < p 3 P k a, we have, using (fT5T) . p x P k+1 a < p 2 P k+1 a < p 3 P k+1 a. Since an = < 
« 2 < a 3 = 1, the lemma is established using induction. 



B Proof of Lemma [3] and Lemma [4 



Consider p 3 P fc+1 a = p 3 ipiP k a+p 32 p 2 P k a+p 3 3p 3 P k a. Since piP k a < p 2 P k a < p 3 P k a 
from Lemma El we have p 3 P k+1 a < p 3 P k a. Lemma H] can be proved similarly. 



C Proof of Lemma [101 



Let p 2 P fc [001] T < p 2 P k 1 [001] T . Multiplying both sides by p 22 and adding to both sides 

P2lPl J P fc - 1 [001] T +P23P3^ 1 [001] T , 

P2lPlP^ 1 [001] T +J9 2 2P2P fe [001] T +P23P3 J P^ 1 [001] T < p 2 P k [00lf (16) 

If we show that p 2 ipiP fc [001] T + p 23 p 3 P fe [001] T < p 2 iPiP k ~ 1 [001} T + paaPsP^t 001 ]^ 
then using (JTSD, p 2 i P iP fc [001] T + p 22 p 2 P fc [001] T + p 23 p 3 P fc [001] T < p 2 P fc [001] T , i.e, 
p 2 P fe+1 [001] T < p 2 P fc [001] T . Consider the inequality 

P2iPiP fc [ooi] T + p2 3 P3P fe [ooi] T < P21 p 1 p k - 1 [ooi) T + P23 p 3 p k - 1 [ooi] T 

p 2 P* +1 [001] T - p 22 p 2 P fe [001] T < p 2 P fe [001] T - p 22 p 2 P h - l [00lf 
^p 2 (P fc+1 [001] T -P /c [001] T ) < p 22 (p 2 P fe [001] T -p 2 P fe - 1 [001] T ) 

p 2 P fc+1 [001] T < p 2 P fc [001] T (17) 
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where the last inequality is from the initial assumption that p 2 P fe [001] T — p 2 P k 1 [001] T < 
0. 

With p 2 P 1 [001} T < p 2 P°[001] T , i.e, p 2 P[001] T < p 23 , using induction, we have the 
p 2 P fc+1 [001] T < p 2 P fc [001] T _ Vk_ > 0. Since steady state exists, by the definition of 

. Thus p 2 lim A .^ oo F fc [001] T = p ss (3) and p ss (3) < p 23 



steady state, lim^oo P k 



Pss 
Pss 



by the monotonic decrease property of p 2 P fc [001] T . Also note that the direction of the 
inequalities throughout this proof can be changed and we can prove that p 2 P fc [001] T 
monotonically increases to p ss (3) as k — > oo if p 2 P[001] T > p 23 . This establishes that 
p 2 P[001] T < P23 is a necessary condition for the first part of the Lemma to hold. 



D Proof of Proposition [121 

Let the probability transition matrix satisfy the following conditions: 

Pl2=P22=P32 (18) 



P23P31 > P21P13 (19) 

The preceding inequality along with condition (fT8|) is equivalent to condition (A) in 
Lemma [TUl Thus under ({TBI and (jT9l) . both Lemma [TOl and Lemma [TT1 hold true. From 
Lemma [101 p 23 > p ss (3). From flUD, p ss (2) = p 22 - Thus p 2 a - p ss a = p 22 a 2 + p 2 3 - 
P ss (2)a 2 - p ss (3) = p 23 - p ss (3) > 0. The system is thus type I. 

Consider a control interval m > 1 with belief vectors 7r m l , 7r m 2 and action a m . If 
we can show for any m that, assuming the greedy policy will be implemented in all the 
future control intervals, the greedy policy is optimal in control interval m, then using 
induction from interval 1, where greedy is indeed optimal, we could establish the long 
term optimality of the greedy policy. Let {QL k } k < m _i = {<3 fc }fc<m-iand let S k be the state 
vector such that Sk{i) is the state of the channel of user i in interval k. We rewrite the 
net expected reward as follows 

{a m , {& k } k 

<m—l J ) ^m,a m ^ 

+ Ps m \n mtl! TT mt2 (S m \TT mt i, 7r mj2 )V m _i(S' m , a m _i), 

Sm 

where V m -\ is the expected future reward conditioned on the state vector in control 
interval m. The hat on this quantity emphasizes the use of the greedy policy in all 
k < m — 1. ^s m |7r ml ,ir m2 (5'mKm,i, ff m,2) is the conditional probability of the current 
state vector S m given the belief vectors vr m l , 7r m 2 . The scheduling decision in the next 
control interval, a m _i, is based on the greedy policy and is a function of the ARQ feedback 
received in the current control interval k, i.e., S m (a m ). The decision logic was summarized 
in Proposition [5j We now proceed to compare the net expected reward when a m = 1 and 
a m = 2. The net expected reward when a m = 1 is written as follows, 
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Vm(^m,l, ^m,2, {dm = 1, {&k} 



k<m—l 



}) 



7T m i a + Pq \ w , 7T 


m,2 (^m 


= [1 




-1 ('S'm r 


= [1 l],a m _i 


"l"-fSm 7Tm,l,7rm,2 (^m 


= [1 2] 






-1 (<Sm 


= [1 2] 


Qm-l — 2) 




= [1 3] 


7T m ,l. 


7Tm,2)Kn- 


-l(Sm 


= [1 3] 


= 2) 


~^Ps m 7r m ,i,7r m> 2 Wm 


= [2 1] 






-1 (5m 


= [2 1] 


o m -i = l) 




= [2 2] 


^"m,l • 




-1 (<Sm 


= [2 2] 


= l) 




= [2 3] 


7Tm,l, 


^m,2)Vm- 


-l(Sm 


= [2 3] 


= l) 


~\~Ps m 7r m ,i,7r m ,2 (^m 


= [3 1] 


" m.l • 




-1 ('S'm 


= [3 1] 


flm-1 = l) 


"("•^Sm 7Tm,l,7rm,2 (^m 


= [3 2] 


^"m,l • 




-1 (<Sm 


= [3 2] 


= l) 


"l"-fSm Tm,l,7Tm,2 


= [3 3] 


7Tm,lj 


n m,2)Vm- 


-l(>S'm 


= [3 3] 


Om-1 = l) 



2) 



(20) 

Note that the scheduler uses the information of the state of the scheduled user (user 
1) alone in the scheduling decisions, consistent with the problem setup. Also note that 
when S m (l) = 2, the schedule is retained. This is consistent with the implementation 
structure of the greedy policy seen in Proposition [5l where the scheduler retains the 
scheduling choice even F 2 is received. As was discussed in the same proposition, this 
is a greedy decision only if an user was never dropped in the past for giving feedback 
F3. Since we are restricting to the class of schedulers that retains the schedule when 
F 3 is satisfied^], this is indeed a greedy decision. Since the Markov channel statistics are 
identical across the users, we have Vk(Sk+i = [x y], a& = 1]) = Vk(Sk+i = [y x], = 2]). 
Expanding the net expected reward when a m = 2 along the lines of (|20|) and using the 
preceding symmetry property, we have, 

Vm\Km,U Km,2, {^m = 1, {&fc}fc<m-l}) — Vm\Km,l, ^m,2, {&m = 2, {$fc}fc<m-l}) 
= 7Tm,l a _ Km.20L 



+ Vm-i^Sm — [3 2],a m _i — l) — V m -i(S m — [2 3],a m _i — l) 

7rm,l(3)7r m>2 (2) (2)7T m , 2 (3) 



(21) 



Let a m indicate the greedy choice among the users in the current control interval, i.e., 
a m = argmax iel 2 i? m (7r m j). Let a m indicate the other user. The net expected reward 
can now be rewritten as, 

V m (TT m ^, 7T mj 2, {a m = a m , {3fc}fc<m-l}) — V m ('K mi l, 7T mj 2, {«-m = 0>m, {^fc}fc<m-l}) 



+ Vm-i (S m — [3 2], a m _i — l) — V m -i (S m — [2 3], a m _i — l) 

^m,a m (3)vT m) a m (2) — TT m ,a m (2)7T mj a m (3) 



X 



(22) 



3 This is the only instance in the proof where we constrain the search space. 
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where, by definition, n m ^ m a > T[ m ^ m oi. We now proceed to show that the quantity 
V m -i(S m = [3 2],a m _i = l) - V m -i(S m = [2 3],a m _i = l) is non-negative. With 
Vk(Sk+i = [x y]) := V k (Sk+i = [x y],a k = l), and expanding V m -t(S m = [x y]) along 
the lines of (EDI) with 7r m _i,i = p x and 7r m _i >2 = p y and a m _i = 1, we have the following. 

K m _i(S m = [3 2]) - V m ^{S m = [2 3]) 



p 3 a - p 2 a + 



V rn - 2 (S m ~i = [3 2]) - V m -2(Sm-l = [2 3]) (p 33 P22 -P23P32)(23) 



By the property of the P matrix, p 33 > p 23 and p 22 > P32- Also, we have seen in 
Lemma3 that r 3 > r 2 > r x . Thus if V^_ 2 (S m _i = [3 2]) - V^_ 2 (S m _i = [2 3]) > 0, 
then V m _i(S m = [3 2]) - V m -i(S m = [2 3]) > 0. Expanding V m - 2 {S m -i = [3 2]) - 
V rn - 2 ^S m -i = [2 3]) > along the lines of (123"]) repeatedly and using Vx\S m = [3 2]) — 
Vi(S m = [2 3]) = r 3 — r 2 > 0, by induction, we can show that V m -2(S m -i = [3 2]) — 
V m -2( S m-i = i 2 3]) > 0. Thus V m ^(S m = [3 2]) - V m ^{S m = [2 3]) > 0. Applying 
this inequality in ( l22l) . we see that the optimality of the greedy policy (in the specified 
class of policies) can be established if we show that the following condition (condition 
(S)) holds: 

7Tm,a m (3)7r ni ,a m (2) > H m ,a m {^m,a m ( 3 ) • (24) 

It appears that the preceding condition is too generic to hold true. However, by con- 
straining the belief vectors to the set of values that will be encountered in the ARQ based 
scheduling problem, we will now show that, (|24j) indeed holds true. 

We first introduce the following result: From Lemma [T0l p2-P fc [001] T monotonically 
decreases to p ss [001] T = Ps S (3) as k increases. Since p2-P fc [010] = p 22 = p ss (2), the 
expected reward from an user given the channel of the user was in state 2 k + 1 intervals 
earlier, given by, p 2 P k a = ct(2)p ss (2) + p 2 P fe [001] T monotonically decreases to p ss CK. 

We proceed with studying the sufficient condition under various belief vectors encoun- 
tered in the ARQ based scheduling problem. Assume the scheduling process has begun 
in a control interval earlier than m and is performed uninterrupted till the horizon, i.e, 
control interval 1 - assumption (A)@. The belief vector of the greedy choice a m and the 
other user a m , for the type I system under consideration, falls under one of the following 
cases. 

• 1. User a m was scheduled in the previous control interval, m + 1, and had given 
a feedback F 3 . The belief vector 7r m) „ m = p 3 . The other user was either scheduled 
in k + 1 control intervals earlier (with 6 1,2,...) with any of the three possible 



Note that there is no loss of generality in this assumption for the following reason: The problem 
setup and the optimality analysis of any policy implicitly assumes uninterrupted scheduling until the 
horizon. This is to be in tune with the interval to interval evolution of the underlying Markov chains. 
Thus when the uninterrupted scheduling process begins at a control interval M, for all m < M condition 
(A) is satisfied automatically. In the control interval M, however, part of the condition, i.e, scheduling 
process began earlier, does not hold. But at the origin, i.e., the control interval M, the belief vectors of 
all the users take the steady state value, p ss . Thus, by all symmetry, the question of what scheduling 
decision to make and hence the question of the optimality of the greedy policy at M becomes irrelevant. 
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feedback or was never scheduled in the past. Thus the belief vector of a m is of the 
form PiP k with j 6 1,2,3 and G 1,2,.... Note that if a m was never scheduled in 
the past, then 7r m5m = Pss which still falls in the preceding form. 

• 2. User a m was scheduled in the previous control interval and had given a feedback 
F\. User a m was either scheduled k + 1 control intervals earlier (with k G 1,2,...) 
with any of the three possible feedbacks or was never scheduled in the past. The 
belief vectors are given by ^ m ,a m — Pi and 7c m: a m — Pi P k with i G 1,2,3 and 

k e 1,2,.... 

• 3. User a m was scheduled in the previous control interval and had given a feedback 
F 2 . User a m was scheduled k + 1 control intervals earlier (with k G 1,2,...) with 
feedback F 1 or was never scheduled in the past. The belief vectors are given by 

K m ,a m = P2 ^m,a m = Pl-P* W itll fc G 1, 2, . . .. 

• 4. User a m was scheduled in the previous control interval and had given a feedback 
F 2 . User a m was scheduled k + 1 control intervals earlier (with A; G 1,2,...) with 
feedback F 2 . The belief vectors are given by 7r mi a m = P2 ^m,a m — P2-P fc with 
k G 1,2,.... 

• 5. User a m was scheduled in the previous control interval and had given a feedback 
F 2 . User a m was scheduled L + 1 or more control intervals earlier with feedback 
F3. L is the number of coherence intervals such that, reward expected from an user 
that was observed to be in state 2 in the previous control interval is higher than the 
reward expected from an user that was observed in state 3 k + 1 control intervals 
earlier iff k > L. Mathematically, L is such that, 

p 2 a > p 3 P h a if k> Lp 2 a < p 3 P k a if k < L (25) 

Note that such an L exists since p 2 a < p 3 a and both p 2 P k a and PsP k a mono- 
tonically decreases (with k) to p ss a < p 2 a. The belief vectors are hence given as 

1"m,« m = P2, 7Tm,a m = P3^ with k > L. 

• 6. User a m was scheduled in the previous control interval and had given a feedback 
F 2 . User a m was scheduled k + 1 control intervals earlier with feedback F 3 with 
k < L. The belief vectors are as follows: 7r mj a m = P3-P fe with k < L and n m: a m = P2- 

The above list is exhaustive. In fact, cases 5 and 6 will never appear since we are 
considering the class of schedulers that never drop an user when it sends an F%. However, 
we will show that even for these cases the sufficient condition is satisfied. In all the above 
6 cases, R m (a m ) > R m (a m ) as required by the definition of a m . We now focus on the 
sufficient condition (S) for each of the above cases. 

• 1. Sufficient condition (S) is given as follows: 

7r m,a m (3)7T m) a m (2) > 71"m,a m (2)vT mi a m (3) 

i.e., p 3 3P^ fe [010] T > p32P^[001] T ,VzG 1,2,3, ke 1,2,... (26) 
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Since p l2 = p 22 = P32, we have 



Pi P fc [010] T = p 12 = p 22 = p 32 \/i G 1,2,3, fee 1,2,... (27) 

Also, Pl P k [001] T = p i P k - 1 P[001] T = PiP*- 1 ^ p 23 p 33 ] T < p 33 , since p 33 > p 23 > 
P13 by the property of the P matrix. Thus (S) holds for case 1. 

• 2. (S) is as follows: p;P fc [001] T ;pi 2 > piP k [0l0} T p 13 , Mi E 1,2,3, fc G 1,2,.. .. 

From the symmetry property (|27p . pi 2 = pjP fc [010] T . Also since p± 3 < p 23 < p 33 
we can show pjP fe [001] T > pi 3 . Thus (S) is satisfied for case 2. 

• 3. (S): p 23 PiP fc [010] T > j9 22 PiP fc [001] T . From Lemma [IH piP fe [001] T monoton- 
ically increases to p s s(3) as k increases as 0,1,2,.... Since p 23 > Pss (3) (using 
Lemma9), we have p 23 > P iP fc [001] T . Also, P iP fc [010] T = p 22 from the symmetry 
property in ( |27l) . Thus (S) holds for case 3. 

• 4. (S): P23P 2 P fc [010] T > jo 2 2p 2 P fc [001] T . From Lemma M, p 2 P fc [001] T mono- 
tonically decreases from p 23 to p ss (3) as k increases as 0,1,2,.... Thus p 23 > 
P2P fc [001] T . This inequality along with the symmetry property ( 12~TI) establishes (S) 
for case 4. 

• 5. (S): p 23 p 3 P fc [010] T > p 22 p 3 P fc [001] T with k > L. Note that for all k > L, 

P 2 « > P 3^ fe « 
^a 2 p 22 +p 23 > a 2 p 3 P fc [010] T + p3P fc [001] T 

^P23 > p 3 P fe [001] T (28) 

where we have used the symmetry property p 22 = p 3 P fe [010] T in obtaining the 
last inequality. (S) is established by using the symmetry property along with the 
preceding inequality. 

• 6. (S): p 3 P fc [001] T p 22 > p 3 P fc [010] T p 23 with k < L. For k < L, p 2 a < p 3 P k a. 
Expanding both the sides along the lines of case 5 and using the symmetry property 
of fT2~7l) . (S) can be established for case 6. 

Thus the sufficient condition for the constrained search space optimality of the greedy 
policy is satisfied. 
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